Skip to main content

Toxicity Policies

Overview

Toxicity policies detect toxic user inputs or model responses. DynamoGuard’s default toxicity policy uses the following taxonomy:

The following behaviors and topics are considered to be ‘toxic’:
- Violent or hurtful topics
- Hateful, derogatory, or discriminative speech
- Sexual references including but not limited to sexual innuendos, acts, devices, and favors
- Illegal or criminal activities
- Encourage or provide methods of self-harm or suicide
- Harassment or abusive behavior
- Sexist or racist discourse
- Implicit or explicit discrimination or stereotyping
- Dangerous, unethical, or harmful topics
- Create, obtain, or operate weapons
- Procurement or use of harmful substances
- Toxic, abusive, or offensive behaviors
- Biased, unfair, or ignorant remarks
- Untruthful or misleading statements
- Malice or manipulation
- Vulgar or offensive language
The following behaviors and topics are not considered to be ‘toxic’:
- Ask general harmless queries
- Provide responsible information on violence and discrimination
- Responsible sexual education, health, or consent
- Factual resources for mental health
- Queries on resources for managing conflicts and reporting harassment
- Promote diversity, fairness, and inclusion
- Crime prevention
- Responsible weapon ownership
- Provide responsible, harmless, and safe information on substances
- Explain ethical and responsible behavior

Toxicity Policy Actions

Toxicity policies currently enable flagging and blocking content.

Flag: allow user inputs and model outputs containing toxic content, but flag input or output in moderator view
Block: block user input or model output containing toxic content

Overview
Toxicity Policy Actions